Why Sponsor Oils? | blog | oilshell.org

Oils Is Exterior-First (Code, Text, and Structured Data)

2023-06-20

I introduced a distinction in Narrow Waists Can Be Interior or Exterior:

This post uses the interior-exterior idea to describe Oils. For example, OSH and YSH are exterior-first, while PowerShell, Elvish, Nushell, and others are interior-first. To see this, we review three aspects of the design:

  1. Units of code. YSH functions are interior, while "procs" are both interior and exterior.
  2. Text. Oils uses UTF-8 strings both in memory and "on the wire". The concept of a Unicode encoding is an interior vs. exterior issue.
  3. Structured data has two closely related forms:
Table of Contents
Recap
Should Shells Have Two Tiers?
Survey of alternative shells
Exterior designs are layered
YSH whipupitude
Code
Procs are Interior or Exterior
FAQ: proc main versus func main
Text
Survey: Programming languages disagree
UTF-8 is for the Interior, not just the Exterior
Possible APIs for YSH
Structured Data
Data structures follow Data languages, not vice versa
The JSON-Unix string mismatch
Personal Stories / History
The First JSON language I designed (2009)
My office-mate looks at JSON (2006)
Oils and JSON
Conclusion
YSH is Python-influenced, but still a shell
Let's fix shell

Recap

This is the third post in a series about YSH. I want it to be a "design roadmap" for contributors and for me, but I hope casual readers will also take something away.

  1. Reviewing YSH - 7 language features, 3 arguments for structured data, ...
  2. Sketches of YSH Features - 14 use cases for blocks, ...
  3. Oils Is Exterior-First. This post returns to our #software-architecture ideas to explain the design from another perspective.

I "forked" another post while writing this one: How to Create a UTF-16 Surrogate Pair by Hand, with Python. It started with an implementation detail, and led to a good discussion about Unicode history, e.g. Windows vs. Unix.

Should Shells Have Two Tiers?

Now let's see how the distinction helps us with the design of YSH. Last year, The Sketch of the Biggest Idea in Software Architecture asked:

Should shells have two tiers? Both external processes and internal "functions"?

Both pipelines of bytes and pipelines of structured data?

We now have answers. YSH will have both:

On the other hand, our pipelines are identical to those that Thompson's original shell pioneered: "real" OS processes communicating over channels created with pipe(). Structured data is layered on top, with textual data languages based on JSON and TSV.

Survey of alternative shells

This table summarizes my impressions of a few alternative shells (corrections are welcome):

Style Shell VM / Scheduler What's in a pipeline? What's Piped?
Interior

PowerShell

.NET VM cmdlet, a kind of class .NET objects, instances of classes
Interior

Elvish

Goroutine scheduler Function, Wrapped Process Garbage-collected Go records, or JSON
Interior

Nushell

Rust async I/O scheduler Builtins or Plugins Rust/serde Objects, JSON/msgpack
Exterior

Oils

Unix Kernel procs or processes Bytes ⊃ Text ⊃ Data Languages

So it appears that most alternative shells are interior-first, but Oils is exterior-first.

The distinction isn't black and white: All shells have both facilities (even bash), so it's more a matter of what's "primary" in the design. It's also a matter of how awkward the interface is — do you have two different "worlds" or tiers to bridge?

Nonetheless, we'll say an interior-first shell favors code that lives within a process, while an exterior-first shell favors coordinating data between processes.

Exterior designs are layered

Notice the layering:

  1. Bourne shell starts with bytes from the kernel, and layers conventions on top.
  2. One convention is the line-based structure that grep, awk, and sed use, which requires understanding ASCII '\n'.
  3. UTF-8 text is layered top of ASCII, in a compatible way.
  4. Oils layers structured data languages on top of UTF-8 text.

I would draw this as:

Bytes   ⊃   ASCII   ⊃   UTF-8   ⊃   Data Languages

I'd also say the exterior style is one level below interior shells, which preserves shell's role as universal glue. If you want to glue together a .NET VM and a Go process, or Clojure program and an R script, your lowest common denominator is probably a pipe, socket, or bash script.

YSH whipupitude

So does YSH have two tiers? Despite having both proc and func, I'm trying to avoid two tiers, at least to the extent that it reduces the whipupitude of shell.

It's "wrong" to think about YSH programs in this Python-like way:

  1. First you manipulate garbage-collected data structures in memory
  2. Then you serialize them to some format.

The "right way" is to program directly with text, including our data languages for strings, records and tables. They are designed to eliminate ad-hoc parsing, which is the main downside of text.

Our in-memory data structures map one-to-one with text, and are in service of text. The encode() and decode() operations on J8 strings are perfect inverses, for arbitrary byte strings.

More details below.

Code

Procs are Interior or Exterior

Why did I say procs are either interior or transparently exterior? Because that's how Bourne shell works, and it's powerful and underused. The simplest usage of a proc occurs in a single process, making it interior:

myproc() {
  cp *.py /tmp
  echo done
}
myproc  # interior call

But you have at least two ways of making procs exterior:

(Implementation status: procs exist in YSH, but we still need to be implement functions.)

FAQ: proc main versus func main

This is a good time to answer a great question from Mastodon. I expect it to be common, so I'll paraphrase:

Now that YSH has functions, can we just ignore procs? Start with func main, and call other functions with typed data?

I don't want to dictate the way people write code, but I think there are some downsides:

The advantages of procs will probably become clearer when actually writing code. I should write more #shell-the-good-parts posts with concrete examples, but until then you can see them all over the Oil repo.

Text

Now that we've discussed interior and exterior code, let's discuss text. It's central to not just shell, but all programming languages.

Survey: Programming languages disagree

Text is also complex and controversial. This article, linked in the appendix of the surrogate pair post post, shows that languages disagree on the length operation:

Programming Languages Length of 🤦🏼‍♂️
Go, Rust, Python 2 17 UTF-8 code units, aka bytes
JavaScript, Java 7 UTF-16 code units
Python 3, bash 5 UTF-32 code units, aka code points
Swift 1 Extended grapheme cluster, which doesn't have a fixed definition

The surrogate pair post also sketches the history of this divergence, which is basically a Unix vs. Windows problem. Languages tend to follow operating systems, so JavaScript, Python, and JSON were dragged along for the 30-year ride.

UTF-8 is for the Interior, not just the Exterior

The length issue correlates with — but isn't identical to — another controversial issue: the representation of strings in memory. That is, the interior representation.

Oils follows the Go language, using an array of bytes, which may or may not be UTF-8 encoded strings:

Contrary to popular belief, and contrary to Python, C, and C++, UTF-8 is a great interior representation. It's naturally compressed in memory, and you can search for ASCII substrings like { or // within it, without decoding.

At some point, I may write Four Reasons New Programming Languages Should Adopt a UTF-8 Centric Design:

  1. New languages use UTF-8 internally: Go, Julia, Rust, Swift, Elixir, ...
  2. Older languages are moving toward UTF-8: Python 3, Ruby, Java
  3. Windows is taking steps toward UTF-8, starting in 2019

Those 3 reasons should be enough. If not, PyPy showed us in 2019 how to use UTF-8 internally, while still retaining O(1) random code point access. You probably don't need this operation, but if you really do, it can be made both time- and space-efficient.

Important: even though Oils is UTF-8 centric, it works languages that use any string representation. The post above would explain why we're diverging from bash.

I would also mention a bug I found in 2018: bash's ${#s}, which measures length in code points, is a non-monotonic function of bytes. That is, adding a byte on the end of a string can reduce its length! This happens because bash doesn't handle invalid UTF-8 properly.

Possible APIs for YSH

Still, I recognize there is a tremendous amount of confusion around strings and UTF-8. We could make our APIs more explicit:

Instead of len(s), we could have

s->numBytes()     # O(1)
s->countRunes()   # O(n), may raise decode error

Decoding:

$ var runes = s->toRunes()
$ write (runes)
[65, 20, 66, ... ]

$ var s2 = Str.fromRunes(runes)  # not a method?

Indexing:

s->byteAt(i)      # O(1)
s->findCharAt(i)  # NOT useful, use toRunes() instead

Or maybe indexing should be s[i] because there's only one O(1) operation. Same question with slicing:

s[i:j]              # O(1)
s->byteSlice(i, j)  # O(1), is this better?

Iteration:

for byte in (s) {  # Go iterates over runes, not bytes
  write (byte)
}

for rune in (s->toRunes()) {
  write (rune)
}

Substring search:

var i = s->find('//')  # remember this works without decoding

Regex:

# replace a byte or rune?
var result = s->replace( / <dot> %end /, ^"$1" ) 

This is just an idea. Right now we have len(s) giving the number of bytes.

Either way, the point is that strings in Oils follow exterior reality. They're arbitrary byte strings that may or may not be UTF-8 encoded. In contrast, bash strings are NUL-terminated, but they also don't have to be valid Unicode. UTF-8 is not present in the Unix kernel — it's layered on top.

Structured Data

Now that we've talked about text, let's talk about structured data. Remember that it's layered on top of text, and that it's a big YSH feature:

Shell Should Be More Like Python, JavaScript, and Ruby

Data structures follow Data languages, not vice versa

This is another way of saying that our data model is designed to be serialized, rather than serialization being an afterthought. In the intro, I said that:


What interior data structures will YSH have? To follow our exterior languages, I've decided on the following data model:

This can be described as either:

The idea is that interior structures and exterior languages map one-to-one, to the degree possible. I think having both ints and floats is important, because both JavaScript and Lua originally had a single number type, and grew proper integers after real usage.

So by choosing data structures to be in service of data languages, YSH is exterior-first.

The JSON-Unix string mismatch

But there are more one-to-one mapping problems. Here's the biggest one: JSON strings don't correspond to Unix strings.

This is a fairly technical issue, so I "forked" another post from this one:

I mentioned another demo in that post:

See blog-code/j8-notation if you want a preview.


How do we fix this? As mentioned in Sketches of YSH Features, we're adding \yff and \u{123456} to JSON strings, and calling those "J8 strings". This is the basis of "J8 Notation".

Mathematically:

Practically speaking, these properties make it easier to write correct shell programs. You can use J8 strings instead of ad-hoc parsing with spaces, newlines, or commas.

Personal Stories / History

We reviewed code, text, and structured data, which showed how YSH favors the exterior viewpoint.

This is because it's meant to compose seamlessly with processes not written in shell!

In other words, Oils is not a closed world. It's part of an operating system, and part of distributed systems. Again, shell is a language that grows: Unix Shell: Philosophy, Design, and FAQs.

The First JSON language I designed (2009)

My "JSON Template" project from 2009 is relevant to the exterior-first philosophy. It's a string templating language that puts serialized data first — hence the name.

It no longer has an official repo because it was hosted on Google Code, which is now defunct. Ironically, I wrote it when I worked on Google Code itself!

(It lives on in the Oil repo as test/jsontemplate.py. It's been part of the "Wild" test report for years, and I recently ported our Soil CI dashboard to it.)


Why did I create JSON Template? Google Code was written in Python and JavaScript, and I didn't like using 2 template languages: one on the server, and another on the client. (Remember how different the ecosystem was prior to 2009: node.js didn't yet exist.)

So I designed a data-driven template language, wrote an interpreter for it in Python, and ported the interpreter line-for-line to JavaScript.

For a ~1200 line program, it was surprisingly influential! It was the "version 1" of Go's text/template:

We were technically co-workers, but Rob and Russ actually just found the project on Reddit. It was exciting to get this validation from much more experience engineers!

After Go 1.0, text/template was redesigned in a more imperative style. The JSON Template influence is still present in:

{{.}}               # "dot"
{{with X}} {{end}}  # push a scope, and conditionally execute

Those correspond to the primitives of JSON Template:

{@}                  # the "cursor"
{.section X} {.end}  # conditionally expand in a JSON namespace

Squarespace also started using it in 2010. I met the founder Anthony when they were a small company with a new office in Manhattan.

I thought they had moved off it, but I found this pretty recent YouTube video, which shows that it's still part of the Squarespace platform? I'd be interested if anyone understands how exactly it's used.


I bring this up to show that it's useful to think about serialized text first, in the Unix style. I don't think of JSON Template as a language for Python, JavaScript, or Go. It stands alone — floating in the cloud — and that requires using a language-independent representation like JSON.

My office-mate looks at JSON (2006)

I might as well drop another JSON story here: I introduced JSON to Python creator Guido van Rossum in 2006, although I'm not sure it led to anything consequential. JSON was added to the Python library a few years later via the library simplejson, which would have happened anyway.

Another one of my defunct Google Code projects was "chutils", which had a program called dice. It was basically JSON Lines or ndjson in 2006: a set of Unix utilities that communicated with JSON over pipes.

I used it to analyze logs from Google's internal dev tools. In particular, the "hist" operator avoided the ad-hoc parsing of sort | uniq -c | sort -n:

$ cat x.tar.gz | to-json-lines | hist cmd  # histogram by field name
 905 log
 405 commit
  89 rm

Guido was my office-mate at the time, and I remember he was pleasantly surprised by this "cool" use of Unix. I then showed him https://json.org/, and he said, "That's just Python!"

I believe that's almost literally true: all JSON will successfully eval() in the Python interpreter, as long as you define null, true, false = None, True, False.

Remember, this was 2006, and JSON was quietly invented in 2001. GMail and Google Maps popularized "AJAX" in 2004 and 2005, which was nominally based on XML. Server-side JavaScript didn't exist (or it was a failed Netscape experiment most people were unaware of.)

(History question: Why do Python and JavaScript share nearly the same syntax for {} and [] container literals? Python appeared in 1990, and JavaScript in 1995. Did they have a common ancestor, or did Python influence JavaScript? I recall that Guido said Python didn't invent this syntax, but I don't know where it came from. C doesn't have it.)

Oils and JSON

So despite playing with JSON for ~15 years, why is Oils coming around to it now? Well, Oils has had rough JSON support since 2019:

(Hmm, re-reading this thread is interesting, I may comment on the issue of objects vs. data later — it's interior vs. exterior !!)

JSON-based data languages are becoming more central though. I'd say the main issues are:

JSON has a few other weaknesses beside the JSON-Unix String Mismatch:

Conclusion

To summarize:

It also applies to shell design: Oils is exterior-first.

YSH is Python-influenced, but still a shell

Now, what was the point of introducing PyObject*? It was an example of an interior narrow waist, but how does that relate to shell?

It explains the design: We won't have extensible data types like Python does! YSH is not for writing vector and matrix libraries :-)

In other words, the narrow waist of Oils is still exterior Unix files, not interior like PyObject.

It's a Python-like language, but it's still a shell. You program "directly" with text, which is now structured.

Let's fix shell

Let's end this post with another question: do these abstract ideas matter?

I think they will be our north star for a clean, focused, and bounded language design. Even though shell is a popular, fast-growing language in 2023, I frequently see comments like these, with many upvotes:

We really gotta stop writing and using software written in shell. There are so many footguns in shell that these types of mistakes are inevitable.

This tells me that the shell language has become so complex that many users have given up hope of ever writing it correctly. They don't even want to start learning it.

For this case, the problem was the difference between "$@" and eval "$@", which I mentioned in this issue.

But even that's confusing: the four characters "$@" look similar to "$x", but have wildly different semantics. And eval implicitly joins its arguments, which is even more confusing.


We have an opportunity to fix this with YSH. We're deep in the middle of it, with a lot left to do. But writing this series of posts has greatly clarified its design. We have a decision for essentially all design issues, although we'll certainly revise the language as we implement it.

I think we can produce something great!

Let me know what you think in the comments, which are now on Zulip.